This dataset has 27 columns and 641,138 rows. Each row is the statistical results per user per course. This dataset includes 16 courses. Three of those are the same course material that offered two different times. This dataset has 476,549 unique registrants of all courses. Detailed descriptions of this dataset can be found in edx data summary.md.
## [1] 641138 28
## [1] 476549 9
We are interested in the backgrounds and activities of the students and how these factors affect their performances in a course. The features in this dataset that we will focus on are: registered, explored, LoE_DI, YoB, gender, and nevents. The response that we are interested in is certified and grade.
In this analysis we mainly use nevents as the main measure of how much efforts that a student makes for the course. Other features such as ndays_act, start_time_DI, last_event_DI and nforum_posts will occasionally be used to support this investigation.
The following new variables were created in the original dataframe edxdata:
age: the age of the user when taking the course. It is calculated by 2013-YOB.access.period: Number fo days between last_event_DI and start_time_DI.access.rate: ndays_act divided by access.period. This variable measures how often an user accesses the course.Also, we created the following new datasets by grouping certain features in the original dataset:
This dataset is created by grouping the raw dataset by userid_DI. The following new variables were created:
total_registered: number of courses registered.total_explored: number of courses explored.user.certificates: number of courses certified.This dataset is created by grouping the raw dataset by course_id. The following new variables were created:
passed_num: total certified users of the course.explored_num: total users who explored the course.registered_num: total users who registered the course.total_nforum_posts: total number of posts in the course.pass.rate : the number of certificated users divided by the number of registered usershangon.rate : the number of explored users divided by the number of registered usersWe did not found any unusual distributions to the best of our knowledge.
In the other hand, we preprocessed a number of features of the raw dataset:
LoE_DI into levels.certified, explored and viewed into logical data type.Firstly we investigate some basic user statistics of each course:
pass.rate of each course:From the above analysis, we can see that CS50X (Introduction to Computer Science I) is the most popular course in terms of the number of course registrants. Also, we can see that pass.rate of all courses are around a few percent.
We then investigate the statistics of the users:
Please note that the y-axis of the above graph is in log scale.
age among all registrants:Although the distribution of age is very wide, most registrants are between 20 to 35 years old.
LOE_DI of all registrants, with NA and blank (“”) filtered:LoE_DI are dominated by Less than Secondary, Master’s and Doctorate.
gender of all registrants, with NA and blank (“”) filtered:We can see that these courses are dominated by male registrants.
Next, we investigates the activities of the registrants using the feature nevents.
nevents of all registrants:## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
In this graph, we set the limit of x-axis to 10000 for clarity, since the data entries with nevents>10000 are very rare. We can see that most of the registrants has nevents less than 1000.
nevents of registrants that passed a course (certified==1)## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Within the group of certified==1, the distribution of nevents becomes more evenly distributed between 0 and 10000.
The figure shows that most of the courses are dominated by male registrants.
The certificated registrants of most courses are also dominated by men, except Poverty and HealthStat.
LoE_DI, by course:The above figure shows that the population of registrants with Bachelor’s and Secondary degree are relatively small in all courses. Note that the x-axis is in log scale.
nevents of all registrants, by course## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The distribution of most of the courses are very similar. All the curves drop very sharply below 100 nevents. After that, the decrease becomes more moderate.
nevents of all registrants with certificates, by course## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The distribution of nevents becomes more Gaussian if we only consider the population of those with certificates. However, the peaks of these distributions vary from course to course.
nevents of registrants with certificates, by courseThis plot is a different representation of the previous plot. This boxplot shows the median and variations of nevents more clearly. Although the median of values of nevents of each course vary a lot, most courses have median nevents values around a few thousands. The course EM has the highest median value of nevents, while the course CSH has the lowest value.
In the next three graphs, we plot similar boxplots of ndays_act, access.period and access.rate instead of nevents.
ndays_act of registrants with certificates, by courseaccess.period of registrants with certificates, by courseaccess.rate of registrants with certificates, by courseWe can see that ndays_act, access.period and access.rate all vary a lot between each course, as we observed in the boxplot of nevents.
We then pick two user activities features nevents and ndays_act, and explore their relation:
ndays_act against nevents among the “explored” registrants (explored==1):From this plot, we can see that n_days_act and nevents have strong correlation. Also, data entries with higher values of nevents and n_days_act have higher chance to be certified.
In the last part of this section, we explore some other features.
total_nforum_posts against pass.rate of each courseSince total_nforum_posts might be an indicator of support offered by the community. One may assume that large amount of total_nforum_posts may therefore help the pass.rate. However, the we did not see such relation in the above plot.
In every course except Proverty(The Challenges of Global Proverty), registrants with certificates are dominated by male users.
Registrants with certificates are largely dominated by those who do not have a Secondary school degree (less than Secondary) or those who hold an advanced degree (Master and PhD). This trend exists in every course in this dataset.
The histogram of nevents is exponentailly distributed among all registrants, but the distribution becomes more Gaussian when only certificated registrants are included.
The variation of nevents among certificated users of each course is large. The average of nevents of each course is typically around a few thousand times.
The course EM (Electricity and Magnetism) has highest mean values of nevent, access.rate and ndays_act across all courses, which suggests that it may be the most demanding course within these 16 courses.
total_nforum_posts is not strongly correlated to pass.rate, indicating that the support from the community in the forum is not an essential factor for pass.rate.
Most users has 100 nevents per n_day_act, as shown in the figure of nevents against n_day_act. It also shows that more nevents and n_day_act has higher chance to pass the course.
Within these 16 courses, registrants who passed these courses are dominated by male users. Also, the certificated students are dominated by those do not have a Secondary school degree (less tha Secondary) or hold an advanced degree (Master and PhD).
In this section, we will explore the activities of users versus grade.
grade of each course## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The distribution of grade is different from course to course, but almost all courses have a peak near zero, as expected. Note that grade of the course CS50X is not available.
nevents against grade, by courseaccess.rate against grade, by courseThe above figures show that the correlation between grade and nevents or access.rate are not very strong.
From these figures, we found that the correlation between grade and nevents or between grade and access.rate are not very strong. However, courses such as EM or CSM13 does show some weak trends that more access.rate or nevents results in higher grade.
The distribution of grade behaves differently from course to course. Two different patterns are commonly seen:
M-shape: Two main peaks exist in the distribution. One is in the region below the pass grade, while another one is in the region above pass grade. Biology, CSM12, CSM13, Poverty has M-shape distribution.
U-shape: Most of the population accumulated at both ends of the range. The distribution of Circuits12, Circuits13 and JusticeX has this characteristic.
This figure is the density plot of education level (LOE_DI) of the registrants by each course, with and without certificates. TRUE are the registrants with certificates, whereas FALSE are the registrants without certificates. Data entries with LOE_DI==NA or "" are ignored. As shown in this figure, the registrants of each course are dominated by the participants who hold an advanced degree or less than Secondary degree. The population of users who hold a secondary or Bachelor’s degree are small in every course. In addition, this plot shows that the composition of education levels does not vary a lot bewteen the population with and without certificates.
This figure shows a box plot and a scatter plot of nevents of each certified participant versus each course. The upper, middle and lower hinges of the boxplot represent the 25th, 50th 75th quantiles ofnevents,respectively. nevents records the number of interactions with the course and therefore can be an indication of the efforts required to pass a course. The boxplot helps comparing the median and spread of nevents between each course. For example, the median and variation of nevents of the course CSH is very small compared to other courses. On the other hand, the course EM has the highest median nevents value among courses, indicating this course needs more efforts to complete.
The figure shows grade against nevent of each course among “explored” registrants (explored==1). Different colors in the plot separate the registrants who earned the certificates or not. This set of figures shows the correlation between grade and nevent are weak. However, courses such as CSM13 or EM shows slightly stronger correlation between grade and nevent.
In this study we analyzed which and how the background information and activities of users affect their performance in a MOOC course. In most of the analysis, we mainly choose the “explored” registrants (explored==1) as the sample space because including all the registrants of each course may make the results very biased because most of the registrants do not involve a lot in the courses. However, it is worth noting that a few percent of certificated registrants are not “explored” registrants. These extreme cases are neglected in our analyses. In addition, we choose to use certified and grade to gauge the performance of the registrants, but different grading policies and requirements for certificates make it difficult to find universal trends between each course.
Through these data visualizations, we found some significant trends in gender and level of education among the registrants who earned certificates. However, we struggled to find strong correlations between the performance against features such as nevents, n_days_act or access.rate, because all the participants have different level of background and targets for exploring a course, which is also an attractive nature of MOOC.
It is possible to explore this dataset further by adopting some statistical learning methods such as regressions or decision trees to predict certified or grade. Also, the correlations between users’ countries (final_cc_cname_DI) and other features would also be interesting to look into.